This is a summary of essential R codes for the QAC 201 course.
In this document you will be able to look up R codes for basic operations, data management, graphing (ggplot), various hypothesis tests, and linear and logistic regressions.
Most of the codes would be presented based on data sets most suitable for the specific technique at hand and explanations. Please remember to customize the codes (data set, personalized subsets, constructed regressions, graphs and etc.) to the need of your assignments and research projects.
This script takes you from loading a previously saved dataframe, to examining your variables, to classes of variables, to basic indexing, and then to frequency tables. There are other little things included as well.
The dataset is either a workspace or a .Rda file. You can load a dataset in one of three ways:
using syntax. The command below uses the folder structure on a PC.
It will be different on a mac.
using the working directory as shown in the video (Session > Set Working Directory > browse to folder) and then by opening the data file shown in the files tab in one of the windows in RStudio.
By navigating to the file in the folder structure outside of R Studio, opening it, and if necessary, selecting the program (R Studio) to use.
For example:
load(“P:/QAC/qac201/Studies/Caterpillar/Data/Caterpillar.RData”)
load(“P:/QAC/qac201/Studies/MarsCrater/Data/marscrater_pds.RData”)
Since you might want to make changes to your data set in future, it makes sense to back up your original data set:
Original_Chick <- ChickWeight
? ChickWeight
FYI, in R, there is a maximum number of rows that will be output to the screen - so at this point, with all the variables still in the file, you won’t see everything. Nor will you see all the rows when you do View(dataFrameName) because there are more observations than R will display there.
‘View(name_of_data_set)’
For example:
View(ChickWeight)
Caveat: view of dataset doesn’t update on its own, so please close tab and re-run command after changing any of the variables.
names (ChickWeight)
## [1] "weight" "Time" "Chick" "Diet"
length (ChickWeight)
## [1] 4
nrow (ChickWeight)
## [1] 578
When we want R to tell us more information about the landscape of our categorical and quantitative variables, we use summary(dataFrameName).
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
For factors (aka, categorical variables), R outputs the number of observations in each category.
For quantitative variables, it gives you the minimum, mean, and maximum, along w the 25th, 50th (mode), and 75th percentiles. Those are the values on the variable for which 25% (or 50 or 75%) of the other observations are below. For character strings, this function gives you the number of valid (i.e., not missing) observations on that variable.
For all of these classes, the output also gives the number of observations that are missing, but the Orange dataset doesn’t seem to have any missing values.
summary(ChickWeight$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.0 63.0 103.0 121.8 163.8 373.0
summary(ChickWeight$Diet)
## 1 2 3 4
## 220 120 120 118
summary(as.factor(ChickWeight$Diet))
## 1 2 3 4
## 220 120 120 118
Otherwise, if you’re using the Caterpillar data, please try the function on the variable host2:
summary(Caterpillar$host2)
summary(as.factor(Caterpillar$host2))
Note also that here we only temporarily used Tree as factor in this single command summary(as.factor(ChickWeight$Diet)).
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
Here, we are telling R to take what is on the RHS (right-hand-side) and store it in the variable named on the LHS (left-hand-side). In this case, it is OVERWRITING what was originally there.
This is a situation that you would usually want to avoid when re-coding variables. (But that’s also why you have a record of your code and a back-up copy of the original dataframe.)
Now you might want to run a summary of your data set and see how this overwriting has changed the values of your variables.
Here, the goal is to reduce the number of columns that are in your dataframe. Because you have a record of all your code, you can always re-run all the following steps later if/WHEN you decide that you need to have more variables.
var.keep <- c(“VAR1”, “VAR2”, “VAR3”)
title_of_new_data_set <- new.data[,var.keep]
[, var.keep] tells R that your subset keeps all the rows of the columns selectedYou should be able to see your subset with summary(title_of_new_data_set)
Big picture: Many statistics programs require dummy coding (in which, for example, Diet, would become 4 separate variables: 1, 2, 3, and 4, and each of those variables would be coded as 0 or 1, with a 1 indicating that a student was in that particular year.)
R is able to do that invisibly behind the scenes. Thus in R, you would use ONE variable, titled YearInSchool. It would be a factor and it would have 5 different values (1, 2, 3, 4) In R, you can actually use strings in quotation marks as the values of a factor. This makes the analyses and data management much more intuitive, but it does add some extra steps.
levels (ChickWeight$Diet)
## [1] "1" "2" "3" "4"
The new levels must be in the same order as the previously existing levels.
Use the assignment operator to assign a new vector of strings (your new labels) to levels(VAR1).
levels(ChickWeight$Diet) = c("W", "X", "Y", "Z")
Then run summary(ChickWeight$Diet) we will see
summary(ChickWeight$Diet)
## W X Y Z
## 220 120 120 118
Now labels W, X, Y, and Z have substituted “1”,“2”,“3”,“4” as values of Diet. This would make more intuitive sense if W, X, Y, and Z are names of diets for chicks.
title_of_data_set <- title_of_data_set[order(title_of_data_set$unique_id,decreasing=F),]
Basic takeaway from lecture and online resource:
ChickWeight$weightrefers to this one variable in the dataframe
ChickWeight [,1]also refers to this one variable (weight) in the dataframe
ChickWeight [2,3]refers to the single value in the 2nd row, 3rd column (chick) value
ChickWeight$weight [2]gives the 2nd value of the variable weight in the dataframe ChickWeight
ChickWeight [2,]refers to the 2nd row in the dataframe, all variables
ChickWeight [2:3, 1:4]gives you the 2nd and 3rd rows in the data frame, with the 1st, 2nd, 3rd, and 4th variables for each
It is more helpful to use variable names because those won’t change when you add or remove columns and when you re-sort the data.
ChickWeight [c(2,4), c("Time", "Diet")]gives you Time and Diet of the 2nd and 4th rows in the dataframe
If you’re using the Caterpillar data,
myCat [myCat$ID == "C611-114", ]gives you all variables for the ID C611-114
When working with indexing, you have to remember that you need to specify which rows you want and which columns you want - those are two separate things. Figuring out how to specify the rows that you want and how to specify the columns that you want is dealing with the logic of what you want from your data. That’s the important [and tricky] part here.
install.packages(“descr”)
install.packages(“Hmisc”)
library(descr) freq(as.ordered(title_of_data_set$VAR1))
head() to display only the first 10 lines of the table generated for ‘ChickWeight$weight’:library (descr)
library(Hmisc)
head(freq(ChickWeight$weight), n=10L)
## Frequency Percent
## 35 1 0.1730104
## 39 8 1.3840830
## 40 5 0.8650519
## 41 20 3.4602076
## 42 15 2.5951557
## 43 4 0.6920415
## 44 1 0.1730104
## 45 1 0.1730104
## 46 2 0.3460208
## 47 1 0.1730104
freq(as.ordered(ChickWeight$Time))
## as.ordered(ChickWeight$Time)
## Frequency Percent Cum Percent
## 0 50 8.651 8.651
## 2 50 8.651 17.301
## 4 49 8.478 25.779
## 6 49 8.478 34.256
## 8 49 8.478 42.734
## 10 49 8.478 51.211
## 12 49 8.478 59.689
## 14 48 8.304 67.993
## 16 47 8.131 76.125
## 18 47 8.131 84.256
## 20 46 7.958 92.215
## 21 45 7.785 100.000
## Total 578 100.000
if cumulative percentages don’t make sense, don’t use as.ordered ()
freq(as.ordered(ChickWeight$Time), xlab = "Time", ylab = "Frequency")
## as.ordered(ChickWeight$Time)
## Frequency Percent Cum Percent
## 0 50 8.651 8.651
## 2 50 8.651 17.301
## 4 49 8.478 25.779
## 6 49 8.478 34.256
## 8 49 8.478 42.734
## 10 49 8.478 51.211
## 12 49 8.478 59.689
## 14 48 8.304 67.993
## 16 47 8.131 76.125
## 18 47 8.131 84.256
## 20 46 7.958 92.215
## 21 45 7.785 100.000
## Total 578 100.000
install.packages(“plyr”)
library (plyr)
dataFrameName <- rename (dataFrameName, c(“OldColName” = “newColName))
ChickWeight <- rename (ChickWeight, c("weight" = "mass"))
Please refer to syntax 4 in Section 1 for a general idea of vectors and factors. A fuller set of operations:
levels(as.factor(ChickWeight$Diet))
## [1] "W" "X" "Y" "Z"
summary(as.factor(ChickWeight$Diet))
## W X Y Z
## 220 120 120 118
ChickWeight$Diet <- factor(ChickWeight$Diet, levels = c("W","X","Y","Z"), labels = c("light","medium","heavy","super"))
The levels you put here are the levels that are currently there. The labels are whatever you want to call it.
levels(ChickWeight$Diet)
## [1] "light" "medium" "heavy" "super"
summary(ChickWeight$Diet)
## light medium heavy super
## 220 120 120 118
You should see the same distribution across the values as their was before making the conversion.
class(dataFrame$VAR1) shows you the variable starts as a factor. If it didn’t (and if you didn’t need to work with levels or labels), use dataFrame$VAR1 <- as.factor((dataFrame$VAR1))For example:
class(ChickWeight$Diet)
## [1] "factor"
summary() will be able to demonstrate that fact:summary(dataFrame$VAR1)
dataFrame\(VAR1[dataFrame\)VAR1 == “”] <- NA
dataFrame\(VAR1[dataFrame\)VAR1 == 99] <- NA
dataFrame\(VAR1 <- droplevels(dataFrame\)VAR1)
levels() not showing NAs, while summary() does show how many NA values there are in your variablelevels(dataFrame$VAR1)
summary(dataFrame$VAR1)
What do I do have a variable is missing and it is missing because the question was not asked? (e.g., How many packs do you smoke a week? is not asked if the person said they have never smoked)
Two possible decisions:
dataframeName\(varName [is.na(dataframeName\)varName)] <- “0”
Leave it as NA. When you report your analyses, say that these are looking at participants who smoked.
Simply leave the values as NA and R automatically excludes them from most analyses
dataframeName$Avg_varName <- (dataframeName$var1+dataframeName$var2+dataframeName$var3)/ 3
For dataframeName$Avg_varName, you can assign any name to replace Avg_varName
| Operation | R codes |
|---|---|
| Equal to | == |
| Larger than or equal to | >= |
| Less than or equal to | <= |
| Strictly greater than | > |
| Strictly less than | < |
| Not equal to | != |
options( digits = N) to specify that we only want N digits for that variable before any printing command.New_VarName <- rep(NA, # of observations)
Here, we are creating a new variable and its name is New_VarName. In this new variable, we assign value NA to each obervations. Since we are collapsing an existing variable in our dataset, we want the new variable to have exactly the same number of obervations as the old one.
> New_VarName [title_of_data_set$Old_VAR == 1 | title_of_data_set$Old_VAR == 2 | title_of_data_set$Old_VAR == 3 | title_of_data_set$Old_VAR == 5 | title_of_data_set$Old_VAR == 6] <- 1
> New_VarName [title_of_data_set$Old_VAR == 4 | title_of_data_set$Old_VAR == 7 | title_of_data_set$Old_VAR == 8 | title_of_data_set$Old_VAR == 9] <- 2
We have successfully collapsed the 9 categories in Old_VAR (supposedly we have this) into 2 categories. Specifically, we have combined categories 1, 2, 3, 5, and 6 of Ola_VAR into category 1 of in New_VarName, and categories 4, 7, 8, 9 into category 2.
Note that | means the logic connection “OR” here, and & means “AND”.
mass into 5 categories by specifying thresholds, we can write:Mass_cat <- rep(NA, 578)
Mass_cat [ChickWeight$mass < 60] <- 1
Mass_cat [ChickWeight$mass >= 60 & ChickWeight$mass < 120 ] <- 2
Mass_cat [ChickWeight$mass >= 120 & ChickWeight$mass < 180 ] <- 3
Mass_cat [ChickWeight$mass >= 180 & ChickWeight$mass < 240 ] <- 4
Mass_cat [ChickWeight$mass >= 240 ] <- 5
summary(as.factor(Mass_cat))
## 1 2 3 4 5
## 127 199 135 76 41
Remember to convert the newly created variable into a categorical variable with as.factor()
ChickWeight$Mass_cat <- as.factor(Mass_cat)
summary(ChickWeight)
## mass Time Chick Diet Mass_cat
## Min. : 35.0 Min. : 0.00 13 : 12 light :220 1:127
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 medium:120 2:199
## Median :103.0 Median :10.00 20 : 12 heavy :120 3:135
## Mean :121.8 Mean :10.72 10 : 12 super :118 4: 76
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12 5: 41
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
install.packages(“ggplot2”)
library (descr)
library (ggplot2)
options (digits = 2)
Before we get the graphing codes, we’ll need to create some sample data to work with:
ID <- 1:12
year <- c("junior", "senior", "senior", "junior", "junior", "senior",
"junior", "senior", "senior", "junior", "junior", "senior")
taskInterest <- c(4,4,1,1,2,5,4,3,3,5,5,NA)
topicInterest <- c(5,5,2,2,3,2,2,1,4,4,4,4)
taskInterestChange <- c("stay the same", "decrease", "increase", "increase", "increase",
"decrease", "stay the same", "stay the same", "decrease",
"increase", "increase", NA)
d <- data.frame (ID, year, taskInterest, topicInterest, taskInterestChange)
This sample data set has 12 subjects (ID column has their unique identifiers).
They are in either their junior or senior year of school.
They rated their interest level on both a task and the task’s topic area.
Their interest ratings are going to be treated quantitatively here, with higher scores indicating higher levels of interest.
They indicated if they thought their interest level in the task would decrease, increase, or stay the same. This is a categorical factor with 3 levels.
summary(d)
## ID year taskInterest topicInterest
## Min. : 1.0 junior:6 Min. :1.0 Min. :1.0
## 1st Qu.: 3.8 senior:6 1st Qu.:2.5 1st Qu.:2.0
## Median : 6.5 Median :4.0 Median :3.5
## Mean : 6.5 Mean :3.4 Mean :3.2
## 3rd Qu.: 9.2 3rd Qu.:4.5 3rd Qu.:4.0
## Max. :12.0 Max. :5.0 Max. :5.0
## NA's :1
## taskInterestChange
## decrease :3
## increase :5
## stay the same:3
## NA's :1
##
##
##
ggplot(d, aes(x=taskInterest)) +
geom_bar(stat="bin", binwidth = 1)
d is the name of the dataset we just constructed. ase() specifies the axes in our output. geom_bar() tells R to generate rectangles with bases on x-axis, while binwidth tells the width of each bar.
To understand the commands in details and more associated arguments, please click on “ggplot2” in packages and find the command you want.
Put your DV on the y-axis and your IV on the x-axis. The third variable can be named under fill (other options exist).
ggplot(d, aes(x=taskInterest, y=topicInterest)) +
geom_point(shape=1, size = 4)
## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(d, aes(x=taskInterestChange, y=taskInterest)) +
stat_summary(fun.y="mean", geom="bar")
## Warning: Removed 1 rows containing missing values (stat_summary).
In stat_summary() we used fun.y="mean" to tell ggplot2 to take the means of our dependent variable. geom() is used to specify the geometric objects that display the data.
ggplot(d, aes(x=taskInterestChange, y=taskInterest)) +
facet_wrap(~ year) +
stat_summary(fun.y="mean", geom="bar")
## Warning: Removed 1 rows containing missing values (stat_summary).
Different distribution by d$year.
ggplot(d, aes(x=taskInterestChange, y=taskInterest, fill = year)) +
stat_summary( fun.y="mean", geom="bar", position = "dodge")
## Warning: Removed 1 rows containing missing values (stat_summary).
Different distribution by d$year but the two panels are merged into one with two colors assigned to differentiate the juniors from the seniors.
-This section is only to show you some possible ways to play with ggplot2:
ggplot(na.omit(d[,c("taskInterestChange", "taskInterest", "year")]),
aes(x=taskInterestChange, y=taskInterest, fill = year)) +
stat_summary( fun.y="mean", geom="bar", position = "dodge") +
labs(x = "Prediction of Change in Interest", y = "Mean Task Interest") +
ggtitle ("Interest Ratings by Year and Predicted Change") +
theme_classic () + scale_fill_manual(values=c("red", "green"))
Here, na.omit() tells ggplot2 to plot without missing data. We have chosen here, by indexing, only to plot three variables, namely, “taskInterestChange”, “taskInterest”, and “year”. position=("dodge") is to adjust position by dodging overlaps to the side.
Theme_classic: A classic-looking theme, with x and y axis lines and no gridlines. We can also choose theme_light, theme_minimal, and etc. Please see Help to get a full list of choices.
To create your own discrete scale, we use scale_fill_manual(values= c("color_of_x_axis", "color_of_y_axis")). Similarly available arguments are scale_size_manual()',scale_shape_manual()’…
Basic R graphs that use built-in tools from descr.
freq() produces a bar graph, it will run on both quantitative and categorical variables but is most appropriate for categorical variables from the package descr:freq(d$taskInterestChange)
## d$taskInterestChange
## Frequency Percent Valid Percent
## decrease 3 25.000 27.27
## increase 5 41.667 45.45
## stay the same 3 25.000 27.27
## NA's 1 8.333
## Total 12 100.000 100.00
hist() produces a histogram will only run on quantitative variables:hist(d$taskInterestChange)
hist() only runs on quantitative variables:
hist(d$topicInterest)
plot (taskInterest ~ topicInterest, data = d)
Putting taskInterest before the tilde tells R to treat it as the DV and put it on the y-axis. Note how basic (and unlabeled!) the graph is. because we name the dataset as an argument to the function, you don’t need the $ notation.
Here, your y-axis is the proportion of responses on your DV that are falling into a specific level of the DV for that level of the IV. LOOK AT THE OUTPUT TO PARSE THIS.
My example treats the different levels of taskInterest as categorical here even though it could also be treated as a quantitative variable. With this syntax, R makes it categorical.
I’m breaking the command down into steps. - TIP: By breaking down code into steps like this, it is easier to debug. - It is also easier to update the code when it is in steps rather than one long, complicated line.
interestChangeTable <- table(d$taskInterest, d$taskInterestChange)
interestChangeTable
##
## decrease increase stay the same
## 1 0 2 0
## 2 0 1 0
## 3 1 0 1
## 4 1 0 2
## 5 1 2 0
It shows the values the variables takes even when it is a quantitative variable. By assigning the TABLE OUTPUT to a variable, we can use it without repeating the syntax to create the table.
From this example, we see that 2 students with a rating of 1 said they would increase.
margin argument. Check that this is what you wanted.interestChangeTableProp2 <- prop.table(interestChangeTable, margin = 2)
Here, it shows us that 67% of people who said they would stay the same had a rating of 4.
The margin you select depends on the order you originally used in your initial table command and what your question is. Put the DV is first in the table command (putting it values in the rows) and then you use the column (2) percentages.
barplot(interestChangeTableProp2[])
This shows the column percentages in barplots.
barplot(prop.table(table(d$taskInterest, d$taskInterestChange), 2))
Choose the one that works better for your assignment/research.
grpMeans <- by(d$topicInterest, d$taskInterestChange, mean, rm.na = TRUE)
grpMeans <- tapply (d$topicInterest, d$taskInterestChange, mean, rm.na = TRUE)
Either output can go directly into the barplot command (or could nest the rhs syntax)
What this command says: apply the function mean to the DV topicInterest for EACH of the levels in the IV taskInterestChange
barplot(grpMeans)
Here, the example actually shows that people who said their interest would decrease were the ones who had the highest mean interest rating in the task.
by syntax within the function ftable is what makes the output readable:grpMeansNew <- ftable(by(d$taskInterest, list(d$year, d$taskInterestChange), mean))
barplot(ftable(by(d$taskInterest, list(d$year, d$taskInterestChange), mean)), beside = TRUE)
Please note that argument beside is important because it tells R to line up the bars side by side instead of stacking them up.
These are frequency distributions.
These work for a categorical variable or a quantitative variable.
ggplot(d, aes(x=taskInterest)) +
geom_bar(stat="bin", binwidth = 1)
ggplot(d, aes(x=as.factor(taskInterest))) +
geom_bar(stat="bin", binwidth = 1)
na.omit(dataframeName)ggplot(na.omit(d), aes(x=as.factor(taskInterest))) +
geom_bar(stat="bin", binwidth = 1)
ggplot(na.omit(d), aes(x=as.factor(taskInterest))) +
geom_bar(stat="bin", binwidth = 1) +
labs(x = "Level of Task Interest (5 = extremely interested)", y = "Frequency") +
ggtitle ("Frequency Distribution of Task Interest") +
theme_bw()
ggplot(na.omit(d), aes(x=as.factor(taskInterest))) +
geom_bar(stat="bin", binwidth = 1, fill = "navy") +
labs(x = "Level of Task Interest (5 = extremely interested)", y = "Frequency") +
ggtitle ("Frequency Distribution of Task Interest") +
theme_bw()
Use fill = "color" arugment in the function geom_bar() to change color.
breaks = seq() argument overrides binwidth argument:ggplot(na.omit(d), aes(x=taskInterest)) +
geom_bar(stat="bin", breaks = seq(.5, 5.5, 1), fill = "grey") +
labs(x = "Level of Task Interest (5 = extremely interested)", y = "Frequency") +
ggtitle ("Frequency Distribution of Task Interest") +
theme_bw()
Remember: Put your DV (response var) on the y-axis and put your IV(explanatory var) on the x-axis.
ggplot(d, aes(x=taskInterest, y=topicInterest)) +
geom_point(shape=1, size = 4)
geom_point() is used to create scatterplot.
? shape
Run the example code at the bottom of the help file. (count L->R from bottom)
ggplot(d, aes(x=taskInterest, y=topicInterest)) +
geom_point(shape=16, size = 4, color = "red") +
labs(x = "Level of Task Interest (5 = high)", y = "Level of Topic Interest (5 = high)") +
ggtitle ("Relationship between Task Interest and Topic Interest") +
theme_bw()
ggplot(d, aes(x=taskInterest, y=topicInterest)) +
geom_point(shape=16, size = 4, color = "red") +
stat_smooth(method="lm", se=FALSE) +
labs(x = "Level of Task Interest (5 = high)", y = "Level of Topic Interest (5 = high)") +
ggtitle ("Relationship between Task Interest and Topic Interest") +
theme_bw()
stat_smooth(method = "lm") tells ggplot2 to draw a best fit line (linear regression) to aid the eye in seeing patterns in the presence of overplotting.
ggplot(d, aes(x=jitter(taskInterest), y=topicInterest)) +
geom_point(shape=16, size = 4, color = "red") +
stat_smooth(method="lm", se=FALSE) +
labs(x = "Level of Task Interest (5 = high)", y = "Level of Topic Interest (5 = high)") +
ggtitle ("Relationship between Task Interest and Topic Interest") +
theme_bw()
x=jitter(taskInterest) doesn’t change anything in the data. It merely jitters the points of the same value so that we can see the points better.
Jittering is more appropriate in scatterplot than other graphs.
ggplot(d, aes(x=jitter(taskInterest,.9), y=jitter(topicInterest,.2))) +
geom_point(shape=16, size = 4, color = "red") +
stat_smooth(method="lm", se=FALSE) +
labs(x = "Level of Task Interest (5 = high)", y = "Level of Topic Interest (5 = high)") +
ggtitle ("Relationship between Task Interest and Topic Interest") +
theme_bw()
Plotting means of a quantitative DV with a categorical IV:
ggplot(d, aes(x=taskInterestChange, y=taskInterest)) +
stat_summary(fun.y="mean", geom="bar")
a little bit of clean up:
ggplot(na.omit(d), aes(x=taskInterestChange, y=taskInterest)) +
stat_summary(fun.y="mean", geom="bar", color = "red", fill = "pink")
We deal with IV and DV that have more than 2 levels. In fact, the IV can be quantitative or categorical here, but if quantitative, bin it.
stacked bar chart - COUNTS within each group:
ggplot(na.omit(d), aes(x=year, fill=as.factor(taskInterest))) +
geom_bar(stat = "bin", position = "stack") +
labs(x = "Year", y = "Frequency") +
ggtitle ("Relationship between Task Interest and Year") +
theme_bw()
grouped bar chart:
ggplot(na.omit(d), aes(x=year, fill=as.factor(taskInterest))) +
geom_bar(stat = "bin", position = "dodge") +
labs(x = "Year", y = "Frequency") +
ggtitle ("Relationship between Task Interest and Year") +
theme_bw()
stacked bar chart - PROPORTIONS within each group (with groups that have different numbers of observations, USE PROPORTIONS)
ggplot(na.omit(d), aes(x=year, fill=as.factor(taskInterest))) +
geom_bar(stat = "bin", position = "fill") +
labs(x = "Year", y = "Proportion of Group") +
ggtitle ("Relationship between Task Interest and Year") +
theme_bw()
ANOTHER WAY TO PLOT distributions when BOTH variables are categorical:
ggplot(na.omit(d), aes(x = taskInterestChange)) +
facet_wrap (~ year) +
geom_bar(stat = "bin") +
labs(x = "Task Interest Change", y = "Frequency") +
ggtitle ("Relationship between Task Interest and Year") +
theme_bw()
Example: quantitative variable on x-axis and categorical variable (again, not a useful graph with this limited dataset):
ggplot(na.omit(d), aes(x = taskInterest)) +
facet_wrap (~ year) +
geom_bar(stat = "bin", breaks = seq(.5, 5.5, 1)) +
labs(x = "Task Interest", y = "Frequency") +
ggtitle ("Relationship between Task Interest and Year") +
theme_bw()
We make the categorical DV into two categories.
Use 0 = no and 1 = yes, since doing this (0 = no, 1 = yes) lets the mean of the variable EQUAL the proportion of ‘yes’ responses in the variable.
line graph of taskInterestChange = increase across taskInterest (quantitative)
d$interestIncreased <- NA
d$interestIncreased[d$taskInterestChange == "increase"] <- 1
d$interestIncreased[d$taskInterestChange == "decrease" |
d$taskInterestChange == "stay the same"] <- 0
ggplot(na.omit(d), aes(x=taskInterest, y = interestIncreased)) +
stat_summary(fun.y="mean", geom="point", color = "red", size = 4) +
geom_line ()
Comment: Here, the plotted point is the percentage of observations in that level of taskInterest who said that their interest would increase.
Using facet or assigning a variable to fill or to color works for scatterplots, line graphs, bar graphs, etc.
You can also assign the 3rd variable to shape or size of points.
# multivariate plotting in THE SAME GRAPH:
ggplot(na.omit(d), aes(x=year, y=taskInterest, fill = taskInterestChange)) +
stat_summary( fun.y="mean", geom="bar", position = "dodge")
# multivariate plotting in DIFFERENT PANELS:
ggplot(na.omit(d), aes(x=taskInterestChange, y=taskInterest)) +
facet_wrap(~ year) +
stat_summary(fun.y="mean", geom="bar")
ggplot(na.omit(d), aes(x=topicInterest, y=taskInterest)) +
facet_wrap(~ year + taskInterestChange) +
stat_summary(fun.y="mean", geom="bar")
ggplot(na.omit(d), aes(x=taskInterestChange, y=taskInterest)) +
stat_summary(fun.y="mean", geom="bar", color = "red", fill = "pink")
In this graph we set the order of the levels of our factor first using c=():
d$taskInterestChange <- factor(d$taskInterestChange,
levels = c("decrease", "stay the same", "increase"))
ggplot(na.omit(d), aes(x=taskInterestChange, y=taskInterest)) +
stat_summary(fun.y="mean", geom="bar", color = "red", fill = "pink")
coord_flip() would switch which variable is on which axis
More formatting options:
ggplot(na.omit(d),
aes(x=taskInterestChange, y=taskInterest, fill = year)) +
stat_summary( fun.y="mean", geom="bar", position = "dodge") +
labs(x = "Prediction of Change in Interest", y = "Mean Task Interest") +
ggtitle ("Interest Ratings by Year and Predicted Change") +
theme_classic () + scale_fill_manual(values=c("red", "green"))
expand.grid()If not all of your labels show up in the plot window, make the window bigger before you click export and copy the graph to the clipboard.
When you click export, and choose to copy the graph, you can change the size of the graph. Select to maintain the ratio, and then use the lower right-hand shaded triangle to make the plotted area smaller. This will make the fonts relatively larger.
InsectSprays. It is very good for our demonstration of how to use ANOVA to approach your research questions.? InsectSprays
InsectSprays: Does the type of insect spray used affect the count of insects?############ R set-up
options (digits = 2)
library (ggplot2)
############ get the data
data (InsectSprays)
summary(InsectSprays)
## count spray
## Min. : 0.0 A:12
## 1st Qu.: 3.0 B:12
## Median : 7.0 C:12
## Mean : 9.5 D:12
## 3rd Qu.:14.2 E:12
## Max. :26.0 F:12
tapply (InsectSprays$count, InsectSprays$spray, mean)
## A B C D E F
## 14.5 15.3 2.1 4.9 3.5 16.7
tapply (InsectSprays$count, InsectSprays$spray, sd)
## A B C D E F
## 4.7 4.3 2.0 2.5 1.7 6.2
tapply (InsectSprays$count, InsectSprays$spray, length)
## A B C D E F
## 12 12 12 12 12 12
We can see that the data frame has 72 observations n 2 variables, one quantitative and the other categorical.
ggplot (InsectSprays[!is.na(InsectSprays$spray) & !is.na(InsectSprays$count),],
aes(x = spray, y = count)) +
geom_boxplot() +
stat_summary(fun.y=mean, colour="darkred", geom="point",
shape=18, size=4) +
labs(x = "Spray", y = "Average # of Insects") +
ggtitle ("Insect Counts with Different Sprays") +
theme_bw()
ggplot (InsectSprays[!is.na(InsectSprays$spray) & !is.na(InsectSprays$count),]
, aes(x = spray, y = count)) +
stat_summary(fun.y="mean", geom="bar", fill = "lightblue") +
stat_summary(fun.y = mean,
fun.ymin = function(x) mean(x) - sd(x),
fun.ymax = function(x) mean(x) + sd(x),
geom = "pointrange",
color = "blue") +
labs(x = "Spray", y = "Average # of Insects (+/- 1 SD") +
ggtitle ("Insect Counts with Different Sprays") +
theme_bw()
Please pay attention to new arguments in our graphing demand and look for possible values in HELP.
bartlett.test(Quantitative_DV ~ Categorical_IV, data=your_data_frame)
bartlett.test(count ~ spray, data=InsectSprays)
##
## Bartlett test of homogeneity of variances
##
## data: count by spray
## Bartlett's K-squared = 26, df = 5, p-value = 9.085e-05
aov.out <- aov(count ~ spray, data=InsectSprays)
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 534 34.7 <2e-16 ***
## Residuals 66 1015 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov.out is the name we assigned to our ANOVA here, but you are free to choose!aov.name <- aov(Quantitative_DV ~ Categorical_IV, data=your_data_frame)TukeyHSD (aov.out)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = InsectSprays)
##
## $spray
## diff lwr upr p adj
## B-A 0.83 -3.9 5.5 1.00
## C-A -12.42 -17.1 -7.7 0.00
## D-A -9.58 -14.3 -4.9 0.00
## E-A -11.00 -15.7 -6.3 0.00
## F-A 2.17 -2.5 6.9 0.75
## C-B -13.25 -17.9 -8.6 0.00
## D-B -10.42 -15.1 -5.7 0.00
## E-B -11.83 -16.5 -7.1 0.00
## F-B 1.33 -3.4 6.0 0.96
## D-C 2.83 -1.9 7.5 0.49
## E-C 1.42 -3.3 6.1 0.95
## F-C 14.58 9.9 19.3 0.00
## E-D -1.42 -6.1 3.3 0.95
## F-D 11.75 7.1 16.4 0.00
## F-E 13.17 8.5 17.9 0.00
InsectSprays$thirdVar <- sample(1:2,72, replace = TRUE)
InsectSprays$thirdVar <- factor(InsectSprays$thirdVar, levels = c(1,2), labels = c("Group A", "Group B"))
summary(InsectSprays)
## count spray thirdVar
## Min. : 0.0 A:12 Group A:36
## 1st Qu.: 3.0 B:12 Group B:36
## Median : 7.0 C:12
## Mean : 9.5 D:12
## 3rd Qu.:14.2 E:12
## Max. :26.0 F:12
Caveat: YOUR ACTUAL VALUES WILL DIFFER BECAUSE IT IS A RANDOM SAMPLE!!
aovA.out <- aov(count ~ spray, data=InsectSprays[InsectSprays$thirdVar == "Group A",])
aovB.out <- aov(count ~ spray, data=InsectSprays[InsectSprays$thirdVar == "Group B",])
summary(aovA.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 1244 248.9 12.2 1.6e-06 ***
## Residuals 30 610 20.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aovB.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 1401 280.2 24.8 8.2e-10 ***
## Residuals 30 340 11.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
by() approach. The big picture here is to see if different levels of the third variable(s) would result in different associations between (the means of) your quantitative DV and (the levels of) your categorical IVby approachGeneral syntax:
by(title_of_dataset, title_of_dataset$ThirdVariable, function(x)
list(aov(QuantitativeResponse ~ CategoricalExplanatory, data=x),
summary(aov(Quantitative Response~ CategoricalExplanatory, data=x))))
For example:
by(InsectSprays, InsectSprays$thirdVar, function(x)
summary(aov(count ~ spray, data = x)))
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 1244 248.9 12.2 1.6e-06 ***
## Residuals 30 610 20.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------------------
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 1401 280.2 24.8 8.2e-10 ***
## Residuals 30 340 11.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To get results from TukeyHSD posthoc tests for each level of gender:
by(InsectSprays, InsectSprays$thirdVar, function(x)
list(summary(aov(count ~ spray, data = x)),
TukeyHSD(aov(count ~ spray, data = x))))
## InsectSprays$thirdVar: Group A
## [[1]]
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 1244 248.9 12.2 1.6e-06 ***
## Residuals 30 610 20.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [[2]]
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = x)
##
## $spray
## diff lwr upr p adj
## B-A -1.6 -9.0 5.8 0.98
## C-A -12.6 -20.4 -4.8 0.00
## D-A -11.4 -19.2 -3.6 0.00
## E-A -11.6 -19.4 -3.8 0.00
## F-A 1.1 -6.0 8.2 1.00
## C-B -11.0 -19.3 -2.7 0.00
## D-B -9.8 -18.1 -1.5 0.01
## E-B -10.0 -18.3 -1.7 0.01
## F-B 2.7 -4.9 10.3 0.88
## D-C 1.2 -7.5 9.9 1.00
## E-C 1.0 -7.7 9.7 1.00
## F-C 13.7 5.7 21.7 0.00
## E-D -0.2 -8.9 8.5 1.00
## F-D 12.5 4.5 20.5 0.00
## F-E 12.7 4.7 20.7 0.00
##
##
## --------------------------------------------------------
## InsectSprays$thirdVar: Group B
## [[1]]
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 1401 280.2 24.8 8.2e-10 ***
## Residuals 30 340 11.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [[2]]
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = x)
##
## $spray
## diff lwr upr p adj
## B-A 4.417 -2.2 11.02 0.35
## C-A -10.821 -17.2 -4.41 0.00
## D-A -6.821 -13.2 -0.41 0.03
## E-A -9.107 -15.5 -2.69 0.00
## F-A 4.350 -2.5 11.21 0.41
## C-B -15.238 -20.9 -9.54 0.00
## D-B -11.238 -16.9 -5.54 0.00
## E-B -13.524 -19.2 -7.83 0.00
## F-B -0.067 -6.3 6.13 1.00
## D-C 4.000 -1.5 9.47 0.26
## E-C 1.714 -3.8 7.18 0.93
## F-C 15.171 9.2 21.16 0.00
## E-D -2.286 -7.8 3.18 0.80
## F-D 11.171 5.2 17.16 0.00
## F-E 13.457 7.5 19.45 0.00
A more statistically-correct approach:
aovInt.out <- aov(count ~ spray * thirdVar, data=InsectSprays)
summary(aovInt.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 534 33.72 3.2e-16 ***
## thirdVar 1 1 1 0.09 0.77
## spray:thirdVar 5 64 13 0.81 0.55
## Residuals 60 950 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
thirdVar is the interaction effect: if this is significant, then do follow-up tests (one-way ANOVA simple effects analyses (control alpha) and/or pairwise comparisons (control alpha)), and you know if spray has a significant main effect or not.thirdVar (that is, does thirdVar have an effect independent of spray)thirdVar is not independent and the iteration terms is not significant.survey.options (digits = 2)
library (ggplot2)
library (MASS)
library(descr)
############ get the data
data (survey)
? survey
Research Question: Is exercise related to smoking habits? (Exercise is my DV here)
Initial data management:
# telling R what order the factor levels should be in (for output)
survey$Smoke <- factor(survey$Smoke,
levels = c("Never", "Occas", "Regul", "Heavy"))
survey$Exer <- factor(survey$Exer,
levels = c("None", "Some", "Freq"))
- Descriptive statistics of the categorical variables we are using here:
summary(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold
## Female:118 Min. :13.0 Min. :12.5 Left : 18 L on R : 99
## Male :118 1st Qu.:17.5 1st Qu.:17.5 Right:218 Neither: 18
## NA's : 1 Median :18.5 Median :18.5 NA's : 1 R on L :120
## Mean :18.7 Mean :18.6
## 3rd Qu.:19.8 3rd Qu.:19.7
## Max. :23.2 Max. :23.5
## NA's :1 NA's :1
## Pulse Clap Exer Smoke Height
## Min. : 35 Left : 39 None: 24 Never:189 Min. :150
## 1st Qu.: 66 Neither: 50 Some: 98 Occas: 19 1st Qu.:165
## Median : 72 Right :147 Freq:115 Regul: 17 Median :171
## Mean : 74 NA's : 1 Heavy: 11 Mean :172
## 3rd Qu.: 80 NA's : 1 3rd Qu.:180
## Max. :104 Max. :200
## NA's :45 NA's :28
## M.I Age
## Imperial: 68 Min. :17
## Metric :141 1st Qu.:18
## NA's : 28 Median :19
## Mean :20
## 3rd Qu.:20
## Max. :73
##
# frequency distribution (with proportions and graphs then tables)
freq(survey$Smoke)
## survey$Smoke
## Frequency Percent Valid Percent
## Never 189 79.7468 80.085
## Occas 19 8.0169 8.051
## Regul 17 7.1730 7.203
## Heavy 11 4.6414 4.661
## NA's 1 0.4219
## Total 237 100.0000 100.000
freq(survey$Exer)
## survey$Exer
## Frequency Percent
## None 24 10.13
## Some 98 41.35
## Freq 115 48.52
## Total 237 100.00
table(survey$Smoke)
##
## Never Occas Regul Heavy
## 189 19 17 11
table(survey$Exer)
##
## None Some Freq
## 24 98 115
table(survey$Smoke, survey$Exer)
##
## None Some Freq
## Never 18 84 87
## Occas 3 4 12
## Regul 1 7 9
## Heavy 1 3 7
ftable as the function name for better formatting:ftable(survey$Smoke, survey$Exer, survey$Sex, survey$Fold)
## L on R Neither R on L
##
## Never None Female 4 2 4
## Male 4 0 4
## Some Female 24 2 24
## Male 8 6 20
## Freq Female 15 0 24
## Male 21 5 21
## Occas None Female 0 0 1
## Male 1 0 1
## Some Female 0 0 3
## Male 1 0 0
## Freq Female 1 0 4
## Male 4 0 3
## Regul None Female 0 0 0
## Male 0 0 1
## Some Female 2 1 0
## Male 2 0 2
## Freq Female 1 0 1
## Male 4 1 2
## Heavy None Female 0 0 0
## Male 1 0 0
## Some Female 0 0 2
## Male 1 0 0
## Freq Female 1 1 1
## Male 2 0 2
prop.table(table(survey$Smoke, survey$Exer),1)
##
## None Some Freq
## Never 0.095 0.444 0.460
## Occas 0.158 0.211 0.632
## Regul 0.059 0.412 0.529
## Heavy 0.091 0.273 0.636
ggplot(survey[!is.na(survey$Smoke) & !is.na(survey$Exer),],
aes(x=Smoke, fill=as.factor(Exer))) +
geom_bar(stat = "bin", position = "stack") +
labs(x = "Smoking", y = "Frequency") +
ggtitle ("Relationship between Smoking and Exercise") +
theme_bw()
ggplot(survey[!is.na(survey$Smoke) & !is.na(survey$Exer),],
aes(x=Smoke, fill=as.factor(Exer))) +
geom_bar(stat = "bin", position = "fill") +
labs(x = "Smoking", y = "Frequency") +
ggtitle ("Relationship between Smoking and Exercise") +
theme_bw()
ggplot(survey[!is.na(survey$Smoke) & !is.na(survey$Exer),],
aes(x=Exer)) +
facet_wrap (~ Smoke) +
geom_bar(stat = "bin") +
labs(x = "Exercise", y = "Frequency") +
ggtitle ("Relationship between Smoking and Exercise") +
theme_bw()
> chi.name <- chisq.test(title_of_data_set$DV, title_of_data_set$IV)
chi.out <- chisq.test(survey$Exer, survey$Smoke)
## Warning in chisq.test(survey$Exer, survey$Smoke): Chi-squared
## approximation may be incorrect
chi.out
##
## Pearson's Chi-squared test
##
## data: survey$Exer and survey$Smoke
## X-squared = 5.5, df = 6, p-value = 0.4828
summary(chi.out)
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## method 1 -none- character
## data.name 1 -none- character
## observed 12 table numeric
## expected 12 -none- numeric
## residuals 12 table numeric
## stdres 12 table numeric
chi.out$observed
## survey$Smoke
## survey$Exer Never Occas Regul Heavy
## None 18 3 1 1
## Some 84 4 7 3
## Freq 87 12 9 7
chi.out$expected
## survey$Smoke
## survey$Exer Never Occas Regul Heavy
## None 18 1.9 1.7 1.1
## Some 78 7.9 7.1 4.6
## Freq 92 9.3 8.3 5.4
chi.out$residuals
## survey$Smoke
## survey$Exer Never Occas Regul Heavy
## None -0.098 0.844 -0.510 -0.070
## Some 0.623 -1.385 -0.022 -0.734
## Freq -0.531 0.901 0.249 0.708
chi.out$residuals tells you which cells are significantly different than expected by chance in a 2x2 design, residuals > |1.96| indicate that cell has an observed number of cases that is significantly different from what would be expected by chanceDivide alpha by number of possible comparisons (note that it is possible comparisons and not just the number you run). IF you had a significant chi-square test, you would follow this up with post hoc tests.
IF AND ONLY IF (i.e., IFF) your Chi-Square is significant, do this post hoc test.
Here, you would compare the distribution of exercise for Never vs. Occas, Never vs. Regul, Never vs. Occassional, Never vs. Heavy, Occas vs. Regul, Occas vs. Heavy, Regul vs. Heavy
To control for your Type I Error Rate, apply the Bonferroni correction to your alpha level divide your original alpha by the number of possible pairwise comparisons
Here, if using alpha = .05, the new significance level to use is .05 / 7 = .007, the observed p value for our pairwise comparisons (based on subsetted chi squares) must be < .007 to be significant
Note that we subset both variables (so they have the same number of rows), and we apply the logical indexing to the rows. The column is already given so don’t need that added inside the brackets also. Example:
chi.out1 <- chisq.test(survey$Exer[survey$Smoke == "Never" | survey$Smoke == "Occas"],
survey$Smoke[survey$Smoke == "Never" | survey$Smoke == "Occas"])
## Warning in chisq.test(survey$Exer[survey$Smoke == "Never" | survey$Smoke
## == : Chi-squared approximation may be incorrect
chi.out1
##
## Pearson's Chi-squared test
##
## data: survey$Exer[survey$Smoke == "Never" | survey$Smoke == "Occas"] and survey$Smoke[survey$Smoke == "Never" | survey$Smoke == "Occas"]
## X-squared = 4, df = 2, p-value = 0.1375
chi.out1$observed
##
## Never Occas
## None 18 3
## Some 84 4
## Freq 87 12
Discussion: Here, you could indicate whether or not exercise patterns were different for never smoked vs occasional smoked. Note that because there are three levels of the dependent variable, we do not know exactly where the difference lies. Depending on your RQ, you may not need that.
Another example:
chi.out2 <- chisq.test(survey$Exer[survey$Smoke == "Never" | survey$Smoke == "Heavy"],
survey$Smoke[survey$Smoke == "Never" | survey$Smoke == "Heavy"])
## Warning in chisq.test(survey$Exer[survey$Smoke == "Never" | survey$Smoke
## == : Chi-squared approximation may be incorrect
chi.out2
##
## Pearson's Chi-squared test
##
## data: survey$Exer[survey$Smoke == "Never" | survey$Smoke == "Heavy"] and survey$Smoke[survey$Smoke == "Never" | survey$Smoke == "Heavy"]
## X-squared = 1.4, df = 2, p-value = 0.4985
chi.out2$observed
##
## Never Heavy
## None 18 1
## Some 84 3
## Freq 87 7
Moderation question: Is the relationship between smoking and exercise (DV) different for different sexes (3rd var)?
# both of these next lines are equivalent (just two different syntaxes)
chisq.test(survey[survey$Sex == "Male","Exer"], survey[survey$Sex == "Male", "Smoke"])
##
## Pearson's Chi-squared test
##
## data: survey[survey$Sex == "Male", "Exer"] and survey[survey$Sex == "Male", "Smoke"]
## X-squared = 4.7, df = 6, p-value = 0.5871
chisq.test(survey[survey$Sex == "Male",]$Exer, survey[survey$Sex == "Male",]$Smoke)
##
## Pearson's Chi-squared test
##
## data: survey[survey$Sex == "Male", ]$Exer and survey[survey$Sex == "Male", ]$Smoke
## X-squared = 4.7, df = 6, p-value = 0.5871
chisq.test(survey[survey$Sex == "Female",]$Exer, survey[survey$Sex == "Female",]$Smoke)
##
## Pearson's Chi-squared test
##
## data: survey[survey$Sex == "Female", ]$Exer and survey[survey$Sex == "Female", ]$Smoke
## X-squared = 2.7, df = 6, p-value = 0.8482
chisq.test(survey[survey$Sex == "Female","Exer"], survey[survey$Sex == "Female", "Smoke"])
##
## Pearson's Chi-squared test
##
## data: survey[survey$Sex == "Female", "Exer"] and survey[survey$Sex == "Female", "Smoke"]
## X-squared = 2.7, df = 6, p-value = 0.8482
by approach: the “x” stays as it is. — Put in your title_of_dataset and variablesSyntax:
by(title_of_dataset, title_of_dataset$ThirdVariable, function(x)
list(chisq.test(x$Categorical_DV, x$Categorial_IV),
chisq.test(x$Categorical_DV,
x$Categorical_IV)$residuals))
For example:
by(survey, survey$Sex, function(x)
chisq.test(x$Exer, x$Smoke))
## survey$Sex: Female
##
## Pearson's Chi-squared test
##
## data: x$Exer and x$Smoke
## X-squared = 2.7, df = 6, p-value = 0.8482
##
## --------------------------------------------------------
## survey$Sex: Male
##
## Pearson's Chi-squared test
##
## data: x$Exer and x$Smoke
## X-squared = 4.7, df = 6, p-value = 0.5871
Want to run more than 2 functions?
by(survey, survey$Sex, function(x)
list(
chisq.test(x$Exer, x$Smoke),
chisq.test(x$Exer, x$Smoke)$residuals))
## survey$Sex: Female
## [[1]]
##
## Pearson's Chi-squared test
##
## data: x$Exer and x$Smoke
## X-squared = 2.7, df = 6, p-value = 0.8482
##
##
## [[2]]
## x$Smoke
## x$Exer Never Occas Regul Heavy
## None 0.254 0.176 -0.683 -0.683
## Some 0.192 -0.677 0.346 -0.292
## Freq -0.329 0.653 -0.053 0.641
##
## --------------------------------------------------------
## survey$Sex: Male
## [[1]]
##
## Pearson's Chi-squared test
##
## data: x$Exer and x$Smoke
## X-squared = 4.7, df = 6, p-value = 0.5871
##
##
## [[2]]
## x$Smoke
## x$Exer Never Occas Regul Heavy
## None -0.373 0.962 -0.208 0.490
## Some 0.648 -1.308 -0.051 -0.734
## Freq -0.348 0.613 0.129 0.365
In this section we’ll use the “survey” dataset from the MASS library. First, we need to set up the basics and familiarize ourselves with the data.
If you have not installed the “MASS” library, use install.packages("MASS")
Set-up:
options (digits = 2)
library (ggplot2)
library (MASS)
############ get the data
data (survey)
? survey
RQ: Is a person’s height related to their pulse? (PULSE is my DV here, and both of the variables are quantitative)
Descriptive statistics:
summary(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold
## Female:118 Min. :13.0 Min. :12.5 Left : 18 L on R : 99
## Male :118 1st Qu.:17.5 1st Qu.:17.5 Right:218 Neither: 18
## NA's : 1 Median :18.5 Median :18.5 NA's : 1 R on L :120
## Mean :18.7 Mean :18.6
## 3rd Qu.:19.8 3rd Qu.:19.7
## Max. :23.2 Max. :23.5
## NA's :1 NA's :1
## Pulse Clap Exer Smoke Height
## Min. : 35 Left : 39 Freq:115 Heavy: 11 Min. :150
## 1st Qu.: 66 Neither: 50 None: 24 Never:189 1st Qu.:165
## Median : 72 Right :147 Some: 98 Occas: 19 Median :171
## Mean : 74 NA's : 1 Regul: 17 Mean :172
## 3rd Qu.: 80 NA's : 1 3rd Qu.:180
## Max. :104 Max. :200
## NA's :45 NA's :28
## M.I Age
## Imperial: 68 Min. :17
## Metric :141 1st Qu.:18
## NA's : 28 Median :19
## Mean :20
## 3rd Qu.:20
## Max. :73
##
summary(survey$Height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 150 165 171 172 180 200 28
summary(survey$Pulse)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 35 66 72 74 80 104 45
ggplot(survey[!is.na(survey$Height) & !is.na(survey$Pulse),],
aes(x=Height, y=Pulse)) +
geom_point(shape=16, size = 4, color = "red") +
stat_smooth(method="lm", se=FALSE) +
labs(x = "Height (cm)", y = "Pulse (beats per min)") +
ggtitle ("Relationship between Pulse and Height") +
theme_bw()
ggplot(survey[!is.na(survey$Height) & !is.na(survey$Pulse),],
aes(x=jitter(Height, 0), y=jitter(Pulse,0))) + # turned off jitter by set to 0
geom_point(shape=16, size = 2, color = "red", alpha = .3) +
#stat_smooth(method="lm", se=FALSE) +
labs(x = "Height (cm)", y = "Pulse (beats per min)") +
ggtitle ("Relationship between Pulse and Height") +
theme_bw()
cor.test(title_of_data_set$DV, title_of_data_set$IV)
cor.test (survey$Height, survey$Pulse)
##
## Pearson's product-moment correlation
##
## data: survey$Height and survey$Pulse
## t = -1.1, df = 169, p-value = 0.275
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.231 0.067
## sample estimates:
## cor
## -0.084
by approachby(title_of_dataset, title_of_dataset$ThirdVariable, function(x) cor.test(~ Quantitaitve_DV + Quantitative_IV, data=x))
Note the different syntax used within the cor.test function
For example:
by(survey, survey$Sex, function(x)
cor.test (~ Height + Pulse, data = x))
## survey$Sex: Female
##
## Pearson's product-moment correlation
##
## data: Height and Pulse
## t = -0.71, df = 83, p-value = 0.4805
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.29 0.14
## sample estimates:
## cor
## -0.078
##
## --------------------------------------------------------
## survey$Sex: Male
##
## Pearson's product-moment correlation
##
## data: Height and Pulse
## t = -0.31, df = 83, p-value = 0.7544
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.25 0.18
## sample estimates:
## cor
## -0.034
Sex is not significant.!is.na on the variables you are analyzing with that temporary dataset.library (ggplot2)
library (MASS) # for the pairs function
# Load mtcars data and look at the documentation
data(mtcars)
help(mtcars)
Research question: I am interested in what characteristics of a car are associated with lower gas mileage.
Syntax:
summary(lm(DV ~ IV, data=title_of_data_set))
+.car.wt.lm <- lm(mpg ~ wt, data=mtcars)
summary(car.wt.lm)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.543 -2.365 -0.125 1.410 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.285 1.878 19.86 < 2e-16 ***
## wt -5.344 0.559 -9.56 1.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.745
## F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10
wt we know that, for every one unit increase in weight, mpg does down by 5.344 units. To visualize the regression:ggplot (mtcars, aes(x = wt, y = mpg)) +
geom_point (shape = 1) +
geom_smooth (method = lm, se = FALSE) # remove ", se = FALSE" to show 95% confidence region
am is a binary categorical variablecar.trans.lm <- lm(mpg ~ am, data=mtcars)
summary(car.trans.lm)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## am 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
am we know that, from level “0” to level “1”, mpg goes up by 7.24 units. To visualize the regression:ggplot (mtcars, aes(x = am, y = mpg)) +
stat_summary(fun.y="mean", geom="bar", color = "red", fill = "pink") +
stat_summary(fun.data = mean_cl_normal, aes(width = 0.2), geom="errorbar", mult = 1.96, color = "red") + # uses Hmisc
geom_smooth (method = lm, aes (group = 1), se = FALSE, color = "red")
pairs (mtcars[,c("mpg", "wt", "am")])
car.mult.lm <- lm(mpg ~ wt + am, data=mtcars)
summary(car.mult.lm)
##
## Call:
## lm(formula = mpg ~ wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.530 -2.362 -0.132 1.403 6.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.3216 3.0546 12.22 5.8e-13 ***
## wt -5.3528 0.7882 -6.79 1.9e-07 ***
## am -0.0236 1.5456 -0.02 0.99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1 on 29 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.736
## F-statistic: 44.2 on 2 and 29 DF, p-value: 1.58e-09
OR
car.mult.lm2 <- lm(mpg ~ wt + am + hp, data=mtcars)
summary(car.mult.lm2)
##
## Call:
## lm(formula = mpg ~ wt + am + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.422 -1.792 -0.379 1.225 5.532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.00288 2.64266 12.87 2.8e-13 ***
## wt -2.87858 0.90497 -3.18 0.00357 **
## am 2.08371 1.37642 1.51 0.14127
## hp -0.03748 0.00961 -3.90 0.00055 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 28 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.823
## F-statistic: 49 on 3 and 28 DF, p-value: 2.91e-11
To compare two multivariate models, we do the following operations. The model with the smaller AIC (here, car.mult.lm2) is the ‘better’ model.
Syntax:
anova(car.mult.lm, car.mult.lm2, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + am
## Model 2: mpg ~ wt + am + hp
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 29 278
## 2 28 180 1 98 9.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(car.mult.lm)
## [1] 168
AIC(car.mult.lm2)
## [1] 156
This tells us that it was good to have added hp to our model.
Another question: Was it good to have added am? (am was non-significant, but it also shows that same overall information using model comparison)
anova(car.wt.lm, car.mult.lm, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + am
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 30 278
## 2 29 278 1 0.00224 0.99
Interpretation(answer): No - this is not significant so the two models are not different from one another, so choose the one with the smaller number of predictors
In this section we use infertility data to demonstrate the use and interpretation of logistic regression.
Load the data:
data(infert)
help(infert)
infert$spon.bin <- 1
infert$spon.bin[infert$spontaneous==0] <- 0
Is age associated with infertility?
We can’t use a linear regression because there is not a linear relationship between infertility and age, as infertility is a binary variable after our previous operation. We can visualize the relationship from a plot:
ggplot(infert, aes(x = age, y = spon.bin)) +
geom_point (shape = 1) +
stat_smooth(method="glm", family="binomial", se = FALSE)
name_of_logistic_regression <- glm(DV ~ IV, data = name_of_data_frame, family = "binomial")
infert.glm1 <- glm(spon.bin ~ age, data=infert, family="binomial")
summary(infert.glm1)
##
## Call:
## glm(formula = spon.bin ~ age, family = "binomial", data = infert)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.311 -1.072 -0.897 1.238 1.613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4871 0.7979 1.86 0.062 .
## age -0.0562 0.0252 -2.23 0.026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 339.12 on 247 degrees of freedom
## Residual deviance: 334.01 on 246 degrees of freedom
## AIC: 338
##
## Number of Fisher Scoring iterations: 4
exp(coef(infert.glm1)) # exponentiated coefficients
## (Intercept) age
## 4.42 0.95
exp(confint(infert.glm1)) # 95% CI for exponentiated coefficients
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.94 21.62
## age 0.90 0.99
Discussions:
coef() to obtain odds ratios of our independent variables.infert.glm2 <- glm(spon.bin ~ age + induced, data=infert, family="binomial")
summary(infert.glm2)
##
## Call:
## glm(formula = spon.bin ~ age + induced, family = "binomial",
## data = infert)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.574 -1.052 -0.661 1.100 2.002
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3939 0.8632 2.77 0.0055 **
## age -0.0713 0.0265 -2.69 0.0071 **
## induced -0.8084 0.2012 -4.02 5.9e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 339.12 on 247 degrees of freedom
## Residual deviance: 315.81 on 245 degrees of freedom
## AIC: 321.8
##
## Number of Fisher Scoring iterations: 4
exp(coef(infert.glm2))
## (Intercept) age induced
## 10.96 0.93 0.45
exp(confint(infert.glm2))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 2.08 62.01
## age 0.88 0.98
## induced 0.30 0.65
Interpretation: Controlling for induced, a 1 year increase in age reduces the relative odds of an abortion by approximately 7%, OR = .93, 95% CI[0.88, 0.98], p = .007. An increase in the number of prior induced abortions (you’d probably want this variable to have been a factor instead) decreases the relative odds of an abortion by a factor of 0.45.
If you wanted to see if there is a different effect of age for different levels of induced abortion, you would use an interaction in your model.
education:summary (infert$education)
## 0-5yrs 6-11yrs 12+ yrs
## 12 120 116
infert.glm3 <- glm(spon.bin ~ age + induced + education, data=infert, family="binomial")
summary(infert.glm3)
##
## Call:
## glm(formula = spon.bin ~ age + induced + education, family = "binomial",
## data = infert)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.594 -1.054 -0.654 1.092 2.029
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0087 1.2310 1.63 0.10
## age -0.0649 0.0279 -2.33 0.02 *
## induced -0.8112 0.2051 -3.95 7.7e-05 ***
## education6-11yrs 0.0926 0.7353 0.13 0.90
## education12+ yrs 0.2940 0.7427 0.40 0.69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 339.12 on 247 degrees of freedom
## Residual deviance: 315.25 on 243 degrees of freedom
## AIC: 325.3
##
## Number of Fisher Scoring iterations: 4
exp(coef(infert.glm3))
## (Intercept) age induced education6-11yrs
## 7.45 0.94 0.44 1.10
## education12+ yrs
## 1.34
exp(confint(infert.glm3))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.64 83.57
## age 0.89 0.99
## induced 0.29 0.66
## education6-11yrs 0.28 5.45
## education12+ yrs 0.33 6.75
6-11 or 12+ years of education are significantly different from 0-5 years of education6-11 and 12+ years of education now include the value of 1, and this indicates that they are not significantinfert$case <- factor(infert$case, levels = c(0,1), labels = c("control", "case")) # from codebook
infert.glm4 <- glm(spon.bin ~ case, data=infert, family="binomial")
summary(infert.glm4)
##
## Call:
## glm(formula = spon.bin ~ case, family = "binomial", data = infert)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.474 -0.870 -0.870 0.907 1.520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.776 0.168 -4.63 3.6e-06 ***
## casecase 1.451 0.286 5.07 4.0e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 339.12 on 247 degrees of freedom
## Residual deviance: 311.76 on 246 degrees of freedom
## AIC: 315.8
##
## Number of Fisher Scoring iterations: 4
exp(coef(infert.glm4))
## (Intercept) casecase
## 0.46 4.27
exp(confint(infert.glm4))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.33 0.64
## casecase 2.46 7.56
anova (infert.glm1, infert.glm2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: spon.bin ~ age
## Model 2: spon.bin ~ age + induced
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 246 334
## 2 245 316 1 18.2 2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(infert.glm1)
## [1] 338
AIC(infert.glm2)
## [1] 322
Interpretation: extremely small p-value tells us that there is a significant difference between the two models, and the model with the smaller AIC (here, infert.glm2) is the ‘better’ model.
This tells us that it was good to have added induced to our model
anova (infert.glm2, infert.glm1, test = "Chisq") # compare two models that differ by only one term
## Analysis of Deviance Table
##
## Model 1: spon.bin ~ age + induced
## Model 2: spon.bin ~ age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 245 316
## 2 246 334 -1 -18.2 2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(infert.glm2)
## [1] 322
AIC(infert.glm3)
## [1] 325
education did not help improve the overall fit of our modelFrom wikipedia page: a confounding variable (also confounding factor, a confound, or confounder) is an extraneous variable in a statistical model that correlates (directly or inversely) with both the dependent variable and the independent variable. A perceived relationship between an independent variable and a dependent variable that has been misestimated due to the failure to account for a confounding factor is termed a spurious relationship, and the presence of misestimation for this reason is termed omitted-variable bias.
Basic R set-up:
library(ggplot2)
data(mtcars)
mtcars$am <- as.factor(mtcars$am)
I am examining whether car weight is a confounder of the relationship between transmission type and gas mileage.
First we test whether the bivariate association is significant:
trans.lm <- lm(mpg ~ am, data=mtcars)
summary(trans.lm)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## am1 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
trans.wt.lm <- lm(mpg ~ am + wt, data=mtcars)
summary(trans.wt.lm)
##
## Call:
## lm(formula = mpg ~ am + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.530 -2.362 -0.132 1.403 6.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.3216 3.0546 12.22 5.8e-13 ***
## am1 -0.0236 1.5456 -0.02 0.99
## wt -5.3528 0.7882 -6.79 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1 on 29 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.736
## F-statistic: 44.2 on 2 and 29 DF, p-value: 1.58e-09
transmission type was associated with mpg in the bivariate caseFrom Wikipedia: In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a qualitative (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation.
As opposed to the confounding case, the item added to our model (the third variable) is not an existing variable, nut one that we created to go with our research interest.
We are here examining gas mileage as a function of transmission type, weight, and the interaction between transmission type and weight.
Syntax:
cars.mod.lm <- lm(mpg ~ am * wt, data=mtcars)
summary(cars.mod.lm)
##
## Call:
## lm(formula = mpg ~ am * wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.600 -1.545 -0.533 0.901 6.091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.416 3.020 10.40 4.0e-11 ***
## am1 14.878 4.264 3.49 0.0016 **
## wt -3.786 0.786 -4.82 4.6e-05 ***
## am1:wt -5.298 1.445 -3.67 0.0010 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.6 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.6 on 3 and 28 DF, p-value: 5.21e-11
ggplot (mtcars, aes(x = wt, y = mpg, group = am, color = am)) +
# put the quantitative variable in the group placeholder
geom_point (shape = 1) +
geom_smooth (method = lm, se = FALSE)